Projects in Data Science
Activity A3b: Joins
Additional reading: Wickham and Grolemund on relational data or Chapter 10 of Data Computing by Kaplan
A join is a data verb that combines two tables.
There are several kinds of join.
A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.
As an example, we’ll examine the following two tables on grades and courses. The Grades file has one case for each class of each student, and includes variables describing the ID of the student (sid), the ID of the session (section), and the grade received. The Courses table has variables for the ID of the session (section), the department (coded), the level, the semester, the enrollment, and the ID of the instructor (iid). We show a few random rows of each table below.
Table 1: Student grades.
| sid | sessionID | grade |
|---|---|---|
| S31680 | session3521 | A- |
| S31242 | session2933 | C+ |
| S32127 | session2046 | B |
| S32058 | session2521 | B+ |
Table 2: Information about each course section.
| sessionID | dept | level | sem | enroll | iid |
|---|---|---|---|---|---|
| session2568 | J | 100 | FA2002 | 15 | inst223 |
| session1940 | d | 100 | FA2000 | 16 | inst409 |
| session3242 | m | 200 | SP2004 | 30 | inst476 |
| session3132 | M | 100 | SP2004 | 18 | inst257 |
The first class of joins are mutating joins, which add new variables (columns) to the left data table from matching observations in the right table.1 There is also a right_join() that adds variables in the reverse direction from the left table to the right table, but we do not really need it as we can always switch the roles of the two tables.
The main difference in the three mutating join options in this class is how they answer the following questions:
Three mutating join functions:
left_join(): the output has all cases from the left, regardless if there is a match in the right, but discards any cases in the right that do not have a match in the left.inner_join(): the output has only the cases from the left with a match in the right.full_join(): the output has all cases from the left and the right. This is less common than the first two join operators.When there are multiple matches in the right table for a particular case in the left table, all three of these mutating join operators produce a separate case in the new table for each of the matches from the right.
One of the most common and useful mutating joins in one that translates levels of a variable to a new scale. For example, below we’ll see a join that translates letter grades (e.g., “B”) into grade points (e.g., 3).
Example 1 (Average class size: varying viewpoints) Determine the average class size from the viewpoint of a student and the viewpoint of the Provost / Admissions Office.
Solution. The Provost counts each section as one class and takes the average of all classes. We have to be a little careful and cannot simply do mean(Courses$enroll), because some sessionIDs appear twice on the course list. Why is that?2 They are courses that are cross-listed in multiple departments! We can still do this from the data we have in the Courses table, but we should aggregate by sessionID first:
CourseSizes<-Courses %>%
group_by(sessionID) %>%
summarise(total_enroll=sum(enroll))
mean(CourseSizes$total_enroll)
## [1] 21.45251
To get the average class size from the student perspective, we can join the enrollment of the section onto each instance of a student section. Here, the left table is Grades, the right table is CourseSizes, we are going to match based on sessionID, and we want to add the variable total_enroll. We’ll use a left_join since we aren’t interested in any sections from the CourseSizes table that do not show up in the Grades table; their enrollments should be 0, and they are not actually seen by any students. Note, e.g., if there were 100 extra sections of zero enrollments on the Courses table, this would change the average from the Provost’s perspective, but not at all from the students’ perspective.
If the by= is omitteed from a join, then R will perform a natural join, which matches the two table by all variables they have in common. In this case, the only variable in common is the sessionID, so we would get the same results by omitting the second argument. In general, this is not reliable unless we check ahead of time which variables the tables have in common. If two variables to match have different names in the two tables, we can write by=c(“name1”=“name2”).
EnrollmentsWithClassSize <- Grades %>%
left_join(CourseSizes,
by=c("sessionID"="sessionID")) %>%
select(sid,sessionID,total_enroll)
Table 3: Student enrollments with the section size.
| sid | sessionID | total_enroll |
|---|---|---|
| S31680 | session3521 | 12 |
| S31242 | session2933 | 18 |
| S32127 | session2046 | 34 |
| S32058 | session2521 | 15 |
AveClassEachStudent<-EnrollmentsWithClassSize %>%
group_by(sid) %>%
summarise(ave_enroll = mean(total_enroll, na.rm=TRUE))
Table 4: Average class size seen by each individual student.
| sid | ave_enroll |
|---|---|
| S31677 | 23.83333 |
| S31242 | 26.72727 |
| S32121 | 23.33333 |
| S32052 | 23.63636 |
The na.rm=TRUE here says that if the class size is not available for a given class, we do not count that class towards the student’s average class size. What is another way to capture the same objective? We could have used an inner_join instead of a left_join when we joined the tables to eliminate any entries from the left table that did not have a match in the right table.
Now we can take the average of the AveClassEachStudent table, counting each student once, to find the average class size from the student perspective:
mean(AveClassEachStudent$ave_enroll)
## [1] 24.41885
We see that the average size from the student perspective (24.4) is greater than the average size from the Provost’s perspective (21.5). It is a fun probability exercise to prove that this fact is always true!!
The second class of joins are filtering joins, which select specific cases from the left table based on whether they match an observation in the right table.
semi_join(): discards any cases in the left table that do not have a match in the right table. If there are multiple matches of right cases to a left case, it keeps just one copy of the left case.anti_join(): discards any cases in the left table that have a match in the right table.A particularly common employment of these joins is to use a filtered summary as a comparison to select a subset of the original cases, as follows.
Example 2 (semi_join to compare to a filtered summary) Find a subset of the Grades data that only contains data on the four largest sections in the Courses data set.
Solution.
LargeSections<-Courses %>%
group_by(sessionID) %>%
summarise(total_enroll=sum(enroll)) %>%
top_n(total_enroll, n=4)
GradesFromLargeSections <- Grades %>%
semi_join(LargeSections)
Example 3 (semi_join) Use semi_join() to create a table with a subset of the rows of Grades corresponding to all classes taken in department J.
Solution. There are multiple ways to do this. We could do a left join to the Grades table to add on the dept variable, and then filter by department, then select all variables except the additional dept variable we just added. Here is a more direct way with semi_join that does not involve adding and subtracting the extra variable:
JCourses <- Courses %>%
filter(dept=="J")
JGrades <- Grades %>%
semi_join(JCourses)
Let’s double check this worked. Here are the first few entries of our new table:
Table 5: Student enrollments in Department J.
| sid | sessionID | grade |
|---|---|---|
| S31185 | session1791 | A- |
| S31185 | session1792 | B+ |
| S31185 | session1794 | B- |
| S31185 | session1795 | C+ |
The first entry is for session1791. Which department is that?
(Courses%>%filter(sessionID=="session1791"))
## # A tibble: 1 x 6
## sessionID dept level sem enroll iid
## <chr> <chr> <int> <chr> <int> <chr>
## 1 session1791 J 100 FA1993 22 inst223
But that only checked the first one. What if we want to double check all of the courses included in Table 5? We can add on the department and do a group by to count the number from each department in our table.
JGrades %>%
left_join(Courses) %>%
group_by(dept) %>%
summarise(total=n())
## # A tibble: 1 x 2
## dept total
## <chr> <int>
## 1 J 148
Exercise 1 (Which to join?) For each of these questions, say what tables you need to join and identify the corresponding variables.
No need to join table. We can just use the Courses table and then group_by %>% summarise
We merge Grades and Courses by sessionID, so that we could identify the semester for each student. Then we group_by students and semester, and then average the grades.
We only need to look at Grades table, assuming we are doing this for the whole school across all time. If we would like to do it by terms or departments, then we need to merge Courses with Grades. Then we filter and look at grades.
Again, merge Grades with Courses by sessionID. Then we group_by department or instructor, and take the average of grades.
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stations
Figure 1: A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.
Figure 2: One of the vans used to redistribute bicycles to different stations.
Here is the code to read in the data:3 Important: To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("https://tiny.cc./dcf/DC-Stations.csv")
The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, sme months of the year than others. The variable sdate gives the time (including the date) that the rental started.
Exercise 2 (Single variable temporal plots) Make the following plots and interpret them:
sdate. Use ggplot() and geom_density().mutate with lubridate::hour(), and lubridate::minute() to extract the hour of the day and minute within the hour from sdate.ggplot(Trips,aes(sdate))+
geom_density()+
labs(x="Date",title="Rentals by Date",subtitle="Washington D.C. Bike-renting")
Trips <- Trips %>%
mutate(daytime=(hour(sdate)+minute(sdate)/60))
ggplot(Trips,aes(daytime))+
labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
geom_density()
Trips <- Trips %>%
mutate(weekday=wday(sdate))
c <- ggplot(Trips,aes(x=weekday))+
geom_density()+
labs(x="Week Day",title="Rentals by Week Day",subtitle="Washington D.C. Bike-renting")
c
ggplot(Trips,aes(daytime))+
labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
geom_density()+
facet_grid(weekday~.)
People go out earlier during the weekends!!!!
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). Do you think these two different categories of users show different rental behavior? How might it interact with the patterns you found in Exercise 2?
Exercise 3 (Customer segmentation) Repeat the graphic from Exercise 2 (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You may also want to set the alpha for transparency and color=NA to suppress the outline of the density function.position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday")). What does the variable wday represent? Try to understand the code.wday and fill with client, or vice versa?ggplot(Trips,aes(daytime))+
labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
geom_density(aes(fill=client),alpha=0.2)
ggplot(Trips,aes(daytime))+
labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
geom_density(aes(fill=client),alpha=0.2,position=position_stack())
It depends. If we would like to observe the distribution of each client group, the first one is better. But the second one tells a good story about how the fraction between the two groups change during the day within the overall distribution.
Trips <- Trips %>%
mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))
The code separate weekend and weekday.
c<-ggplot(Trips,aes(daytime))+
labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
geom_density(aes(fill=client),alpha=0.2)+
facet_grid(wday~.)
c
ggplot(Trips,aes(daytime))+
labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
geom_density(aes(fill=wday),alpha=0.2)+
facet_grid(client~.)
I prefer the facet by weekday, so that I could compare the two types of clients. But each tells a different story.
I really like the facet by weekday, fill by client. It shows the different in behaviors between client types, weekday and weekend. Basically, casual renters behave like registered renters during weekend.
c
Stations to make a visualization of the total number of departures from each station in the Trips data. To layer your data on top of a Google map, start your plotting code as follows:
totalTrips<-Trips%>%
group_by(sstation)%>%
summarise(totalDep=n())
Stations <- Stations %>%
left_join(totalTrips,by=c("name"="sstation"))
myMap <- get_map(location="Logan Circle",source="google",maptype="roadmap",zoom=13)
ggmap(myMap)+
geom_point(data=Stations,aes(x=long, y=lat, color=totalDep,size=totalDep),alpha=0.6)+
scale_colour_gradient(low="#323FFA",high="#D50000")+
labs(title="Departures from Each Stations",subtitle="Washington D.C. Rental Bikes")
## Warning: Removed 165 rows containing missing values (geom_point).
Exercise 5 Only 14.4% of the trips in our data are carried out by casual users.4 We can compute this statistic via mean(Trips$client=="Casual"). Create a map that shows which area(s) of the city have stations with a much higher percentage of departures by casual users. Interpret your map.
casualTrips<-Trips%>%
group_by(sstation)%>%
summarise(casual.fraction=mean(client=="Casual"))
Stations <- Stations%>%
left_join(casualTrips,by=c("name"="sstation"))
ggmap(myMap)+
geom_point(data=Stations,aes(x=long, y=lat, color=casual.fraction,size=casual.fraction),alpha=0.6)+
scale_colour_gradient(low="#323FFA",high="#D50000")+
labs(title="Casual Client Departures from Each Stations",subtitle="Washington D.C. Rental Bikes",
color="Fraction of Casual Clients",size="Fraction of Casual Clients")
## Warning: Removed 165 rows containing missing values (geom_point).
We could see that the areas with most percentage of casual clients are near the parks and tourist areas.
Exercise 6 (High traffic points) 5 Hint for part(a): as_date(sdate) converts sdate from date-time format to date format.
wday (weekend/weekday), and count the total number of trips in each of the four groups. Interpret your results.
Trips <- Trips %>%
mutate(sday=date(sdate))
top10Dep <- Trips%>%
group_by(sstation,sday)%>%
summarise(totalDep=n())%>%
arrange(desc(totalDep))%>%
head(n=10)
knitr::kable(
top10Dep[,],caption="Top 10 Departures By Day"
)
Table 6: Top 10 Departures By Day
| sstation | sday | totalDep |
|---|---|---|
| Columbus Circle / Union Station | 2014-11-12 | 11 |
| Jefferson Dr & 14th St SW | 2014-12-27 | 9 |
| Lincoln Memorial | 2014-10-05 | 9 |
| Lincoln Memorial | 2014-10-09 | 8 |
| 17th St & Massachusetts Ave NW | 2014-10-06 | 7 |
| Columbus Circle / Union Station | 2014-10-02 | 7 |
| Georgetown Harbor / 30th St NW | 2014-10-25 | 7 |
| Massachusetts Ave & Dupont Circle NW | 2014-10-01 | 7 |
| New Hampshire Ave & T St NW | 2014-10-16 | 7 |
| 14th & V St NW | 2014-11-07 | 6 |
top10Trips <- Trips %>%
semi_join(top10Dep,by=c("sstation"="sstation","sday"="sday"))
top10Client <- top10Trips %>%
group_by(client,wday)%>%
summarise(totalDep=n())
ggplot(top10Client,aes(x=client,y=totalDep,fill=wday))+
geom_col(position="stack")+
labs(title="Spatiotemporal Renting Pattern",subtitle="For Top 10 Departures Location & Day",
fill="Day Type",x="Client Type",y="Total Departures")
Among the top 10 combination of day & location departures, most clients are registered and rent bikes on weekday. Casual clients in this group mostly rent bike during weekend.